# Image-Text Understanding

Qwen2 VL 7B Instruct GGUF
Apache-2.0
A quantized version of the multimodal model based on Qwen2-VL-7B-Instruct, supporting image-text-to-text tasks with various quantization levels.
Image-to-Text English
Q
XelotX
201
1
Razorback 12B V0.2
Other
Razorback 12B v0.2 is a multimodal model combining the strengths of Pixtral 12B and UnslopNemo v3, featuring visual understanding and language processing capabilities.
Image-to-Text Transformers Supports Multiple Languages
R
nintwentydo
17
3
Lava Phi
MIT
A vision-language model based on Microsoft's Phi-1.5 architecture, combined with CLIP for image processing capabilities
Image-to-Text Transformers Supports Multiple Languages
L
sagar007
17
0
Llava 1.6 Mistral 7b Gguf
Apache-2.0
LLaVA is an open-source multimodal chatbot, trained by fine-tuning LLM on multimodal instruction-following data. This version is the GGUF quantized version, offering multiple quantization options.
Text-to-Image
L
cjpais
9,652
106
Llava Phi2
MIT
Llava-Phi2 is a multimodal implementation based on Phi2, combining vision and language processing capabilities, suitable for image-text-to-text tasks.
Image-to-Text Transformers English
L
RaviNaik
153
6
Mmalaya
Apache-2.0
MMAlaya is a multimodal system developed based on the large language model Alaya, comprising three core components: a large language model, an image-text feature encoder, and a feature transformation module.
Image-to-Text Transformers
M
DataCanvas
31
1
Llava V1.5 13B AWQ
LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.
Text-to-Image Transformers
L
TheBloke
141
35
Llava Pretrain Vicuna 7b V1.3
LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.
Text-to-Image Transformers
L
liuhaotian
54
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase